An old college friend emailed me this morning with a query from friends of his: If you only had a limited number of tests available for a virus, such as the coronavirus, how should you pool tests to make the most effective use of the tests? This would involve combining samples together into groups, and testing groups; if a group tests negative, you're done, but if it tests positive, you would test each sample separately. The question is, given the prevalence of the virus in the population, what is the optimal number of samples to group together (so the number of tests used will be the least possible)?
I spent a few minutes trying to solve the problem. I came up with the following: Suppose the virus has prevalence \(p\). Suppose we put \(n\) samples into a group. I wanted to find the expected number of tests. The probability of doing one test would be the probability of all \(n\) tests being negative, when the probability each test is positive is \(p\). The probability of all \(n\) tests being negative would be \((1-p)^n\). The probability of at least one positive test would be \(1-(1-p)^n\). In that case, \(n+1\) many tests are necessary (the original test, and then all \(n\) samples separately). So the expected value is \[ \sum_x x p(x) = 1\cdot(1-p)^n + (n+1)\left[ 1-(1-p)^n\right] = n+1 - n(1-p)^n .\] Now if the population to test is \(N\) individuals, there will be \(N/n\) many groups, and a total expected number of tests of \[ \frac{N}n \left( n+1 - n(1-p)^n \right) = N\left( 1 + \frac1n - (1-p)^n\right). \] Given \(p\) (and \(N\)), the problem is to minimize this expression. This doesn't lend itself to algebraic solution. I tried \(p=0.01\), or 1% prevalence. Trial and error led to \(n=11\) as the best choice.
It was at this point I decided to look for more information (I've heard of group testing, so I knew there would be more information on this). A Wikipedia article indicated that the first paper on group testing was published in 1943 by Robert Dorfman. I was pleased that he had the same solution I describe above, in fact with almost the same notation. (See The Detection of Defective Members of Large Populations.)
This week, research on prevalence of the coronavirus in the state of Indiana was reported. It turns out the seroprevalence of coronavirus in the population is 2.8%. Dorfman calculated that with a prevalence of 3%, the optimal number of tests is 6, and that by using this group strategy, you would save two-thirds of your costs. So instead of needing 6.73 million tests to test the 6.73 million people in Indiana, you only need 2.24 million tests.